Parallelism In Uniprocessor Systems

Introduction to Parallel Processing 📚

In the realm of modern computing, the pursuit of enhanced performance and efficiency has led to significant advancements in parallel processing techniques. Parallelism, a key concept in computing, involves executing multiple processes or tasks simultaneously to optimize computational speed and resource utilization.

🔄Serial vs. Parallel Processing

📝

Serial Processing

Tasks are executed sequentially, one after the other

⚡

Parallel Processing

Multiple tasks executed simultaneously to optimize speed

🏗️Forms of Parallel Processing

💻

Uniprocessor Systems

Single processor with parallel capabilities

🔢

Multi-processor Systems

Multiple processors working together

🧩

Multi-core Architectures

Multiple cores on a single processor chip

📊Key Concepts

📈

Scalability

System's ability to handle increasing workloads by adding more resources

⚖️

Load Balancing

Effective utilization of resources without overloading any single component

Basic Uni-processor Architecture 🏗️

A usual single processor computer has three major parts: main memory, a central processing unit, and input/output devices.

🖥️VAX-11/780 Super-minicomputer Example

System Architecture of VAX-11/780

CPU

Synchronous Backplane Interconnect

Main Memory

I/O Devices

🧩Components of VAX-11/780

🧠

CPU

Main controller with sixteen 32-bit general purpose registers, one register as program counter

📊

Status Register

Special register with information about current state of processor and program

🧮

ALU

Arithmetic logic unit with optional floating point accelerator and local cache memory

🔌

Console

Operator interface connected to floppy disk

🔗Interconnection

The CPU, main memory, and I/O devices all connect to a common bus called the synchronous backplane interconnect. Through this bus, all I/O devices can communicate with each other, the CPU, or memory. Peripheral storage and I/O devices can connect directly to the bus through a controller.

Parallelism in Uniprocessor Systems ⚡

❓What is Parallelism?

Parallel computing is the method done by computer systems to execute multiple instructions at the same time by allocating each task to different processors. This capability is done in a uniprocessor system through various techniques such as utilizing multi-core processors or multiple cores within a single processor chip, separating a job into smaller sub-tasks that can be processed concurrently, or leveraging specialized hardware or software to coordinate parallel processing.

🔄Parallelism Techniques in Uniprocessor

🚰

Pipelining

Technique that allows a processor to execute a set of instructions simultaneously by dividing the instructions execution process into several stages

🔄

Multitasking

Method that permits a single processor to run multiple tasks at the same time by dividing processor's time into short intervals

🚰Pipelining in Detail

Pipelining allows a processor to carry out multiple instructions at the same time by dividing the execution process into several phases. Each stage in the pipeline operates on a different instruction concurrently, allowing one instruction to be fetched from memory while another is being executed.

Pipelining Process

Instruction 1: Fetch 📥

Instruction 1: Decode | Instruction 2: Fetch 🔄

Instruction 1: Execute | Instruction 2: Decode | Instruction 3: Fetch ⚙️

Instruction 1: Write Back | Instruction 2: Execute | Instruction 3: Decode | Instruction 4: Fetch ✅

This parallelism enhances the throughput of the processor and improves performance.

🔄Multitasking in Detail

Multitasking works by dividing the processor's time into short intervals and rapidly changing between tasks. Each task gets allocated a particular time slot to execute. Although the processor executes only one task at a time, this rapid switching creates the illusion of parallel processing.

Multitasking Process

Task 1: Executes for time slice ⏱️

Task 2: Executes for time slice ⏱️

Task 3: Executes for time slice ⏱️

Task 1: Executes again (cycle repeats) 🔄

📈Performance Considerations

These methods enhance the performance of a single processor. However, as the number of tasks or instructions running at the same time grows, the performance eventually decreases. Therefore, a multiprocessor is required here to boost performance for highly parallel tasks.

Advantages of Uniprocessor Parallelism ✅

⚡

Improves Performance

Improves the performance of a uniprocessor by allowing it to execute multiple tasks or instructions simultaneously. This is achieved by increasing throughput which reduces the time required to complete a particular task.

💰

Cost Effective

Parallelism in uniprocessor is cost-effective for applications that do not require the performance of a multiprocessing system. The cost of a uniprocessor with parallelism is often lower compared to a multiprocessing system.

🔋

Low Power Consumption

A uniprocessor consumes less power than a multiprocessor system which makes it suitable for mobile and battery powered devices.

💻Real-World Example: Smartphone Processors

Modern smartphones use uniprocessor chips with multiple cores (e.g., octa-core processors). These chips implement parallelism through pipelining and multitasking to provide smooth user experience while maintaining low power consumption for battery life.

📱

Apple A16 Bionic

6-core processor with 2 performance cores and 4 efficiency cores, using pipelining for parallel execution

🤖

Qualcomm Snapdragon 8 Gen 2

8-core processor with advanced pipelining and multitasking capabilities for Android devices

Disadvantages of Uniprocessor Parallelism ❌

📊

Limited Scalability

Parallelism is achieved in a very limited way and as the number of tasks or instructions being executed simultaneously increases the performance decrease. This makes it unsuitable for applications that require high levels of parallelism.

⚡

Limited Processing Power

It has limited processing power as compared to a multiprocessing system hence it is not suitable for applications that require high computational power like scientific simulations and large-scale data processing.

🔧

Complex Design

Implementing parallelism in a uniprocessor can be complex as it requires careful design and optimization to ensure that the system operates correctly and efficiently this increases the development and maintenance costs of the system.

🖥️Real-World Example: Gaming Laptops

While gaming laptops with high-end uniprocessor chips can handle most games well, they struggle with extremely demanding tasks like real-time ray tracing or complex physics simulations at high settings. These tasks require the massive parallel processing power of dedicated GPUs or multi-processor systems.

🎮

Cyberpunk 2077

Runs well on high-end uniprocessor systems but requires GPU acceleration for advanced ray tracing features

🌊

Fluid Dynamics Simulations

Complex simulations require multi-processor systems for real-time processing

Applications of Parallelism in Uniprocessor 📱

🎬

Multimedia Applications

In multimedia applications such as video and audio playback, image processing, and 3D graphics rendering it helps in increasing performance.

🌐

Web Servers

Provides assistance to web servers by allowing them to handle multiple requests simultaneously which makes it more reliable.

🤖

AI and Machine Learning

It improves performance in artificial intelligence and machine learning applications allowing them to process large amounts of data more quickly.

🌡️

Scientific Simulations

Parallelism performs scientific simulations such as weather forecasting, fluid dynamics, and molecular modeling.

💾

Database Management Systems

Parallelism in uniprocessors is used to improve the performance of database management systems by allowing them to handle large volumes of data more efficiently.

📱Real-World Application Examples

🎬

Video Editing Software

Applications like Adobe Premiere Pro use uniprocessor parallelism for real-time video preview and rendering

🎮

Game Engines

Unity and Unreal Engine utilize pipelining in uniprocessors for smooth game performance

🌐

Web Browsers

Chrome and Firefox use multitasking to handle multiple tabs and web processes simultaneously

Hardware Approaches for Parallelism 🔧

🔢Multiplicity of Functional Unit

In earlier computers, the central processing unit (CPU) had just one arithmetic logic unit that could only carry out one function at a time. This slowed down the execution of long sequences of arithmetic instructions. To improve this, the number of functional units in the CPU was increased so that parallel and simultaneous arithmetic operations could be performed.

💻CDC-6600 Computer Example

The CDC-6600 computer has ten different functional units built into its central processing unit:

➕

Fixed Add

Handles fixed-point addition operations

➖

Fixed Multiply

Handles fixed-point multiplication operations

÷

Fixed Divide

Handles fixed-point division operations

🔢

Floating Add

Handles floating-point addition operations

🔢×

Floating Multiply

Handles floating-point multiplication operations

🔢÷

Floating Divide

Handles floating-point division operations

🔄

Increment

Handles increment operations

🔄

Shift

Handles shift operations

🔀

Boolean

Handles boolean operations

🔗

Branch

Handles branch operations

These ten units work independently and can run at the same time. A scoreboard keeps track of which functional units and registers are available. With 10 functional units and 24 registers, the instruction issue rate can be greatly increased.

💻IBM 360/91 Example

Another great example of a multifunction uniprocessor is the IBM 360/91. It has two parallel execution units: one for integer arithmetic and one for floating point arithmetic. The floating point unit has two functional units inside it - one for float add/subtract and one for float multiply/divide. The IBM 360/91 is a highly pipelined, multifunction scientific processor.

🚰Parallelism and Pipelining within CPU

Parallel adders that use methods like carry-lookahead and carry-save are not integrated into all arithmetic logic units, unlike the bit-serial adders used in early computers. Techniques like high-speed multiplier recoding and convergent division allow parallel processing and sharing of hardware components for multiply and divide operations.

The execution of instructions is now divided into multiple pipeline stages, including fetching the instruction, decoding it, fetching operands, executing the arithmetic logic, and storing the result. To allow overlapped execution of instructions through the pipeline, techniques like instruction prefetching and data buffering have been developed.

Instruction Pipeline Stages

Stage 1: Instruction Fetch 📥

Stage 2: Instruction Decode 🔍

Stage 3: Operand Fetch 📊

Stage 4: Execute ⚙️

Stage 5: Write Back ✅

🔄Overlapped CPU and I/O Operation

The input/output (I/O) operations can be carried out at the same time as the CPU computations through the use of separate I/O controllers, channels, or I/O processors. A direct memory access (DMA) channel enables direct transfer of information between the I/O devices and main memory. DMA operates by cycle stealing, which is transparent to the CPU. Additionally, I/O multiprocessing such as utilizing I/O processors in the CDC-6600 can accelerate data transfer between the CPU and external devices.

🔄

DMA (Direct Memory Access)

Allows I/O devices to transfer data directly to/from memory without CPU intervention

⏱️

Cycle Stealing

DMA uses bus cycles when CPU is not using them, making it transparent to CPU operations

🔌

I/O Processors

Specialized processors that handle I/O operations independently of main CPU

🗄️Use Hierarchical Memory System

The CPU is about 1000 times faster than memory access. A hierarchical memory system can be used to close up the speed gap.

Memory Hierarchy

Registers (Fastest, Smallest) 📝

Cache Memory ⚡

Main Memory (RAM) 💾

Secondary Storage (SSD/HDD) 💿

Tertiary Storage (Tape/Cloud) ☁️

The most internal level is the register files that can be directly accessed by the ALU. The cache memory can function as a buffer between the CPU and main memory. Block access of main memory can be accomplished through multiway interleaving across parallel memory modules.

⚖️Balancing of Subsystem Bandwidth

In general, the CPU is the fastest unit in computer with a processor cycle of tp of tens of nanoseconds. The main memory has a cycle time tm of hundreds of nanoseconds and I/O devices are the slowest with an average of access time td of few milliseconds. It is observed that:

td > tm > tp

For example, the IBM 370/168 has td of 8 ms, tm = 360 ns and tp = 90 ns. With these speed gaps between the subsystems, we need to match their processing bandwidth in order to avoid a system bottleneck problem.

📊Bandwidth Definitions

💾

Memory Bandwidth (Bm)

Number of memory words that can be accessed per unit time: Bm = W / tm

⚡

Processor Bandwidth (Bp)

Maximum CPU computation rate (e.g., 160 megaflops in Cray-1, 12.5 million instructions per second in IBM 370/168)

📈

Utilized CPU Rate (Bu)

Actual performance achieved: Bu ≤ Bp

⚖️Bandwidth Balancing Strategies

⚡

CPU-Memory Balancing

Use fast cache memory between CPU and main memory with access time similar to CPU

💾

Cache Function

Acts as data/instruction buffer, transferring blocks of memory words from main memory

🔄

Memory-I/O Balancing

Use communication channels with different speeds between slow I/O devices and main memory

🔀

Buffering and Multiplexing

I/O channels execute buffering and multiplexing functions to move data from multiple devices

🗃️

Advanced Controllers

Disk controllers or database machines can filter non-relevant data directly from tracks

Software Approaches for Parallelism 💻

🔄Multiprogramming

Within a given time period, multiple processes may be running concurrently in a computer system. These processes compete for memory, input/output, and CPU resources. We know that some programs are CPU-intensive while others are I/O-intensive. We can execute a mix of program types to balance usage across different hardware components. Interleaving program execution is meant to enable better utilization through overlapping of I/O and CPU operations.

🔄Multiprogramming Process

When a process P1 is occupied with I/O, the scheduler can switch the CPU to process P2. This allows multiple programs to run simultaneously. When P2 finishes, the CPU can switch to P3. Note that interleaving I/O and CPU work and CPU wait times are greatly reduced.

Multiprogramming Example

Process P1: CPU Time (c) ⚙️

Process P1: I/O Time (i) - CPU switches to P2 🔄

Process P2: CPU Time (c) ⚙️

Process P2: I/O Time (i) - CPU switches to P3 🔄

Process P3: CPU Time (c) ⚙️

The interleaving of CPU and I/O operations across multiple programs is called multiprogramming.

⏱️Time Sharing

Multiprogramming on a single processor involves the CPU being shared by many programs. Sometimes, a high priority program may occupy the CPU for a long time which prevents other programs from sharing it. This issue can be resolved through a method called timesharing.

Timesharing builds on multiprogramming by assigning fixed or variable time slots to multiple programs. This provides equal opportunities for all programs competing to use the CPU.

👥Virtual Processors

The timesharing use of the CPU by multiple programs on a single processor computer creates the concept of virtual processors. Each program behaves as if it has its own dedicated processor, even though they're all sharing the same physical CPU.

💻Applications of Time Sharing

Timesharing is especially effective for computer systems connected to many interactive terminals. Each user at a terminal can interact with the computer. Timesharing was first developed for single processor systems. It has also been extended to multi-processor systems.

💻

Unix/Linux Systems

Use time sharing to allow multiple users to interact with the system simultaneously

🖥️

Mainframe Computers

Early mainframes used time sharing to serve multiple terminal users

🌐

Web Servers

Modern web servers use time sharing concepts to handle multiple client requests

🔄Multiprogramming vs. Time Sharing

Aspect	Multiprogramming	Time Sharing
Primary Goal	Maximize CPU utilization by overlapping I/O and CPU operations	Provide responsive interactive computing to multiple users
CPU Allocation	Based on I/O operations (CPU switches when process waits for I/O)	Based on fixed/variable time slices (quantum)
User Interaction	Not primarily designed for interactive use	Designed for interactive use with terminals
Response Time	May vary significantly depending on system load	More consistent response time for interactive users
Examples	Early batch processing systems	Unix, Multics, modern interactive systems

Conclusion 🏁

🔍Key Takeaways

⚡

Parallelism in Uniprocessors

Single processor systems can achieve parallelism through various hardware and software techniques

🔧

Hardware Approaches

Include multiple functional units, pipelining, hierarchical memory, and overlapped I/O operations

💻

Software Approaches

Include multiprogramming and time sharing to maximize resource utilization

⚖️

Performance Considerations

Bandwidth balancing between subsystems is crucial for optimal performance

📈Real-World Impact

Parallel processing techniques in uniprocessor systems have revolutionized computing by enabling significant performance improvements without the need for multiple processors. These techniques are fundamental to modern computing devices, from smartphones to supercomputers.

📱

Consumer Electronics

Smartphones, tablets, and laptops use uniprocessor parallelism for smooth user experience

🖥️

Enterprise Systems

Servers and workstations utilize these techniques for efficient resource management

🔬

Scientific Computing

Even in single-processor systems, parallelism enables complex calculations and simulations

🚀Future Directions

As computing demands continue to grow, the principles of parallelism in uniprocessor systems remain relevant. However, for extremely high-performance requirements, multi-processor and multi-core systems become necessary. The future lies in hybrid approaches that combine the best of both worlds.

🔀

Hybrid Architectures

Combining uniprocessor parallelism techniques with multi-core designs

⚡

Advanced Pipelining

More sophisticated pipeline designs with deeper and wider stages

🤖

AI-Optimized Processors

Specialized uniprocessor designs optimized for artificial intelligence workloads

💡Final Thought

Parallelism in uniprocessor systems demonstrates that even with a single processing unit, significant performance improvements can be achieved through clever hardware and software design. These techniques form the foundation of modern computing and continue to evolve to meet the ever-increasing demands for computational power.